m 



43 






By Gilles Blanchard, 1 Olivier Bousquet and Pascal Massart 
Fraunhofer-Insitute FIRST, Google and Universite Paris-Sud 



The Annals of Statistics 

2008, Vol. 36, No. 2, 489-531 

DOI: 10.1214/009053607000000839 

© Institute of Mathematical Statistics, 2008 

STATISTICAL PERFORMANCE OF SUPPORT VECTOR 

MACHINES 
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*~r*. The support vector machine (SVM) algorithm is well known to 

^^ ■ the computer learning community for its very good practical results. 

The goal of the present paper is to study this algorithm from a sta- 
tistical perspective, using tools of concentration theory and empirical 
processes. 
pH . Our main result builds on the observation made by other authors 

^0 ' that the SVM can be viewed as a statistical regularization procedure. 

From this point of view, it can also be interpreted as a model selection 
principle using a penalized criterion. It is then possible to adapt gen- 
eral methods related to model selection in this framework to study 
two important points: (1) what is the minimum penalty and how does 
it compare to the penalty actually used in the SVM algorithm; (2) 
is it possible to obtain "oracle inequalities" in that setting, for the 
specific loss function used in the SVM algorithm? We show that the 
answer to the latter question is positive and provides relevant insight 
l/-s | to the former. Our result shows that it is possible to obtain fast rates 

l/~) . of convergence for SVMs. 

q 

^J 1. Introduction. The success of the support vector machine (SVM) al- 

QO \ gorithm for pattern recognition is probably mainly due to the number of 

remarkable experimental results that have been obtained in very diverse do- 
mains of application. The algorithm itself can be written as a nice convex 
optimization problem for which there exists a unique optimum, except in 
rare degenerate cases. It can also be expressed as the minimization of a reg- 
ularized functional where the regularizer is the squared norm in a Hilbert 
space of functions on the input space. Although these are nice mathemati- 
cal formulations, quite amenable to analysis, the statistical behavior of this 
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algorithm remains only partially understood. Our goal in this work is to 
investigate the properties of the SVM algorithm in a statistical setting. 

1.1. The abstract classification problem and convex loss approximation. 
We consider a generic (binary) classification problem, defined by the follow- 
ing setting: assume that the product X x y is a measurable space endowed 
with an unknown probability measure P, where y = { — 1, 1} and X is called 
the input space. The pair (X,Y) denotes a random variable with values in 
X x y distributed according to P. We will denote Px the marginal distri- 
bution of variable X. We observe a set of n independent and identically 
distributed (i.i.d.) pairs (A^,Yi)f =1 sampled according to P. These random 
variables form the training set. 

Given this sample, the goal of the classification task is to estimate the 
Bayes classifier, that is, the measurable function s* from X to y which min- 
imizes the probability of misclassification, also called generalization error, 
S(s*) = F[s*(X) ^Y\. It is easily shown that s*(x) = 2 x 1{P(Y = l\X = 
x) > ^} — 1 a.s. on the set {P(Y = 1\X = x) 7^ ^l- Note that it is an abuse 
to call s* "the" minimizer of the misclassification error, since it can have 
arbitrary value on the set {P(Y = 1\X = x) = ^}. In the sequel, we refer to 
s* as a fixed function, for example, if we choose arbitrarily s* to be 1 on the 
latter set. 

Having a finite sample from P, a seemingly reasonable procedure is to find 
a classifier s minimizing the empirical classification error £ n (s) = - J2i 1{ S (^Q) 7^ 
Yi}, with the minimization performed over some model of controlled com- 
plexity. However, this is in most cases intractable in practice because it is 
not a convex optimization procedure. This is the reason why a number of 
actual classification algorithms replace this loss by a convex loss over some 
real-valued (instead of {—1, 1} valued) function spaces. This is the case of 
the SVM where such a "proxy" loss is used ensuring convexity properties. 
Its relation with the classification loss will be detailed in Section 2.1. 

1.2. Motivations. 

Relative loss and oracle-type inequalities. In the last two decades of the 
last century, the theoretical study of various classification algorithms has 
mainly focused on deriving confidence intervals about their generalization 
error. The foundations of this theory have been laid down by Vapnik and 
Chervonenkis as soon as 1971 [38]. Such confidence intervals have been de- 
rived for SVMs and, more generally, so-called "large margin classifiers," for 
example, using the notion of fat-shattering VC dimension; see [2]. 

However, it is probably fair to say that the explicit confidence intervals 
thus obtained are never sharp enough to be of practical interest — even 
though effort, legitimately, has been and is still made to obtain tighter 
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bounds. On the other hand, we argue that uniform confidence intervals 
about the generalization error are not the most adapted tool to understand 
correctly the behavior of the algorithm. 

If we compare the classification setting to regression, we see that, in re- 
gression, the loss of an estimator is always measured relatively to a target 
function /* (e.g., through 1? distance). Furthermore, recent work (see, e.g., 
[22] ) has shown that a precise study of the behavior of the relative loss when 
the estimator / is close to /* is a key element for proving correct convergence 
rates. This approach is sometimes called "localization." 

In this paper we follow this general principle in the context of SVMs. 
Our main quantity of interest will therefore be the relative loss, for the 
proxy loss function, of s with respect to s* , instead of the absolute loss itself 
(the average relative loss will also be called risk). In this regard, this work 
should be put in the context of a general trend in the recent literature on 
classification and, more generally, statistical learning, where the focus has 
shifted to the relative loss (see also below Section 5.1.2 for further discussion 
on this point). 

Of course, a confidence interval for the relative loss is not informative, 
since s* is unknown; instead, the goal to be aimed at is an oracle-type 
inequality. The term oracle inequality originally refers to a risk bound for 
a model selection procedure where the bound is within a constant factor of 
the risk of a minmax estimator in the best model; that is, almost as good 
as if this best model had been known in advance through an "oracle". In 
the present context, we use more loosely the term "oracle-type inequality" 
to designate a bound where the risk of the estimator can be compared to 
the risk of the best approximating functions coming from any model under 
consideration plus a model-dependent penalty term; this without knowing in 
advance which models are best. This approach typically allows us to obtain 
precise bounds on the rates of convergence toward the target function. 

SVM and regularized model selection. It has been noted by several au- 
thors (see Section 2.3) that SVMs can be seen as a regularized estimation 
method, where the regularizer is the squared norm of the estimating func- 
tion in some reproducing kernel Hilbert space. We show that this can also 
be interpreted as a penalized model selection method, where the models are 
balls in this Hilbert space. This allows us to cast the SVM problem into a 
general penalized model selection framework, where we are able to use tools 
developed in [22], in order to obtain oracle-type inequalities over the family 
of considered models. 

1.3. Highlights of the present work. 

A generic, versatile model selection theorem. To be applied to SVMs, 
the results of [22] need to be extended to a setting where various parameters 
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are model-dependent, resulting in various technical problems. Therefore, we 
decided to devote a whole section (Section 4) of this paper to the extension of 
these model selection results in a very general setting. We believe this result 
is of much interest per se because it can be useful for other applications 
(at least when the loss function is bounded model-wise) and constitutes an 
important point of this work. 

Is the SVM an adaptive procedure? The application of the above general 
result to SVMs is an example of the power of this approach, and allows us to 
derive a nonasymptotic oracle-type inequality for the SVM proxy risk. This 
is the main result of this paper. The interesting feature of oracle- type bounds 
is that they display adaptivity properties: while the regularization term used 
in the estimator does not depend on assumptions on the target function, 
the bound itself involves the approximation properties of the models to the 
target function. Therefore, the (fixed) estimation procedure "adapts" to how 
well the target is approximated by the models. This is in contrast to other 
related work on the subject such as [12, 32], where typically the optimal 
bound is obtained for a choice of the regularization constant that depends 
a priori on these approximation properties. 

Is the SVM regularization function adequate? Our result allows us to cast 
a new light on a very interesting problem, namely, concerning the adequate 
regularization function to be used in the SVM setting. Our main theorem 
establishes that the oracle-type inequality holds provided the regularizer 
function is larger than some lower bound CGI/Hfci 71 )) which is a function of 
the Hilbert norm ||/||fc and the sample size n. Since the oracle inequality 
bound is nonincreasing in function of the regularization term, choosing the 
regularization precisely equal to C(ll/IU> n ) will result in the best possible 
bound allowed by our analysis. The precise behavior, as a function of the 
sample size n, of C(ll/Ilfe; n ) depends on a capacity analysis of the kernel 
Hilbert space. For this, we provide two possible routes, either using the 
spectrum of the kernel integral operator, or the supremum norm entropy of 
the kernel space. In particular, we show (in both situations) that, while the 
squared Hilbert norm is traditionally used as a regularizer for the SVM, a 
linear function of the Hilbert norm is enough to ensure the oracle inequality: 
this suggests that the traditional regularizer could indeed be too heavy. 

Using several kernels. Another interesting consequence of the model se- 
lection approach is that it is possible to derive almost transparently an 
oracle-type inequality in an extended situation where we use several kernels 
at once for the SVM. Namely, the different kernels can be compared via 
their respective penalized empirical losses. The oracle inequality then states 
that this amounts to selecting the best kernel available for the problem. 
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Influence of the generating probability on the convergence rate. It has 
been recently pointed out (see [23, 35]) that in the classification setting, the 
behavior of the function rj(x) = ¥\Y = 1\X = x] in the neighborhood of the 
value 2 plays a crucial role in the optimal convergence rate toward the Bayes 
classifier. In this paper we assume that rj(x) is bounded away from the value 
i by a "gap" rjo and study the influence of r/o on the risk bounds obtained. 
An interesting feature of the result is that the knowledge of r] is not needed 
to define the estimator itself: it only comes into play through a remainder 
term in the bound. 

Note that, for a strictly convex proxy loss, this type of assumption on rj 
essentially influences the relation between classification risk and proxy risk 
(see [4] ) , while it has no impact on the statistical behavior of the proxy risk 
itself. Because the proxy loss used by the SVM is not strictly convex (it is 
piecewise linear), the setting considered in the present paper is different: the 
gap assumption plays a role directly in the inequalities for the proxy risk 
and not in the relation with the classification risk. 

1.4. Organization of the paper. In Section 2 we present the SVM algo- 
rithm, show how to formulate it as a model selection via penalization method 
and survey existing results. In Section 3 we state the main result of the paper 
for the SVM and discuss its implications and scope. The main tool to derive 
these results, which handle penalized model selection in a generic setting, 
is given in Section 4 — we hope that its generality will make it useful in the 
future for other settings as well. We subsequently show how to apply this 
general result to the special case of the SVM. Section 5 contains a compar- 
ison of our result to other related work and concluding remarks. Finally, 
Section 6 contains the proofs of the results. 

2. Support vector machines. For details about the algorithm, its basic 
properties and various extensions, we refer to the books [13, 29, 37]. We give 
here a short presentation of the formulation of the algorithm with emphasis 
on the fact that it can be thought of as a model selection via penalization 
method. 

2.1. Preliminaries: loss functions. With some abuse of notation, we de- 
note Pg :=E,[g(X, Y)] for an integrable function g from X x y to R. Also, 
we introduce the empirical measure defined by the sample as 



1 n 



n 



so that P n g denotes n Y^h=i9(Xi,Yi). Finally, we denote n(x) = P[Y = 1| 
X = x\. 
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Before we delve further into the details of the support vector machine, we 
want to establish a few general preliminaries useful to understand the goals 
of the rest of the paper. 

The natural setting to study SVMs is real-valued classification where we 
build estimators f n of s* as real- valued functions, being understood that the 
actual binary classifier associated to a real function is obtained by taking its 
sign. We therefore measure the probability of misclassification by comparing 
the sign of f n (X) to Y, thus rewriting the generalization error as 

£(f n ) = F[Yf n (X) < 0] = E[9(Yf n (X))), 

where 9{z) = l{z < 0} is called the 0-1 loss function. By a slight abuse of 
notation, we also denote by 9 the following functional: 

9(f):=(x,y)^l{yf(x)<0}. 

We define the associated risk (or relative average loss) function 

8CL s*) := F[Yf n (X) < 0] - F[Ys*(X) < 0] = P9(f n ) - P9(s*). 

However, as will appear in the next section, the classification error #(•) is 
not the actual measure of fit used by the algorithm of the support vector 
machine; it uses instead the "hinge loss" function defined by £{z) := (1 — z)+, 
where (•)+ denotes the positive part. Similarly, we also denote by £ the 
following functional: 

*(/):= (s,l/)"(l-l//(s)) + ; 

the associated risk function is denoted 

L(f n ,s*):=E[£(f n )-£(s*)]. 

As mentioned in the introduction, using this convex loss allows for a tractable 
optimization problem for actual implementation of the algorithm. Since £ is 
the loss function actually used to build the SVM classifier, the aim of our 
analysis is to derive oracle inequalities about its associated risk L. 

However, as the main goal of classification is ultimately to obtain low 
generalization error f, it is only natural to ask the question of the connection 
between the two above losses. It is obvious that 9(x) < £(x) and therefore 
that £(f) < E[£(/)]. Nevertheless, recalling our main focus is on risks (i.e., 
relative average loss), this remark is not really satisfactory and the two 
following additional questions are of primary interest: 

• How is the real-valued function /* minimizing the averaged hinge loss 
E[£(/*)] related to the optimal classifier s*7 

• How are 0(-,-) an d L(-,-) related? 
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(Again, note that it is not entirely correct to talk about "the" function /* 
minimizing the hinge loss, since it is not unique: in the sequel we will assume 
a specific choice has been made.) 

The following elementary lemma gives a satisfactory answer to these ques- 
tions: 

Lemma 2.1. (i) Let s* be a minimizer of £{s) over all measurable func- 
tions s from X into {—1, 1}. Then the following holds: 

E[^(s*)]=minE[£(/)], 

where the right-hand side minimum is taken over all measurable real-valued 
functions on X. Furthermore, if f* is a minimizer of E[£(f)], then f* = s* 
a.s. on the set {F[Y = 1\X = x] $ {0, ±, 1}}. 
(ii) For any P -measurable function f , 

G(f,s*)<L(f,r). 

Part (i) of the lemma can be found in [19] and part (ii) in [40], but we 
give a self-contained proof in Section 6.1 for completeness. Since the choice 
of /* is arbitrary among minimizers of E[£(/)], (i) implies that we can choose 
f* = s*, which will be assumed from now on. 

2.2. The SVM algorithm. There are several possible ways of formulating 
the SVM algorithm. Historically, it was formulated geometrically. First sup- 
pose the input space X is a Hilbert vector space and that the two classes can 
be separated by a hyperplane. The SVM classifier is then the linear classi- 
fier obtained by finding the hyperplane which separates the training points 
in the two classes with the largest margin (maximal margin hyperplane). 
The margin corresponds to the smallest distance from a data point to the 
hyperplane. 

Now, in general, X may not be a Hilbert space, but is mapped into one 
where the above algorithm is applied. For computational tractability of the 
algorithm, it is crucial that this Hilbert space can be generated by a (repro- 
ducing) kernel, whose properties we sum up briefly here. 

Assume we have at hand a so-called kernel function k:X x X ^M, mean- 
ing that k is symmetric and positive semi-definite, in the following sense: 

Vn,V(xi, . . . ,x n ) G X n ,y(ai, . . . ,a n ) G M. n V^ aiajk(xi,Xj) > 0. 

It can be proved that such a function defines a unique reproducing kernel 
Hilbert space (RKHS for short) Tik of real- valued functions on X . Namely, 
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define TCk as the completion of span{/c(x, •) : x £ X}, with respect to the 
norm induced by the following inner product: 

n m n m 

(u, v)k = y^ yi aibjk(xi,Xj) for u = y^ajfc(xj, •) and v = Y^ bjk(xj, •); 

here the completion is defined in such a way so that it consists of real 
functions on X as announced. We denote the norm in Tik by || • ||^. 

Since Tit is a Hilbert space of real- valued functions on X, any element w 
oiTCk can be alternatively understood as a vector or as a function. Moreover, 
this space has the so-called reproducing property which can be expressed as 

Vu £TCk,Vx £ X u(x) = (u, k(x, ■))&. 

Finally, as announced, the input space X is mapped into Tik by the simple 
mapping x t— > k(x, •), and, thus, the scalar product of the images of x, x' £ X 
in Tiu is just given by k(x,x'). 

Now, in that space, a hyperplane is defined by its normal vector w and a 
threshold b £ R as 

H(w, b)={v£H k : (w,v) k + 6 = 0}. 

It is easy to see [29] that the maximum margin hyperplane (when it exists) 
is given by the solution of the following optimization problem: 

min ollwlL 

w&n k ,bem 2 

under the constraints: \/i = 1, . . . ,n,Yi((w, k(Xi, ■))& + b) > 1. 

However, it can happen that the data is not linearly separable (i.e., the above 
constraints define an empty set). This has led to considering the following 
relaxed optimization problem, depending on some constant C > 0: 

n 
1ii n2 •-* \ — "* ^ 

mm tj \\w \\u + C > ti 

2 = 1 

(2.1) under the constraints: Vi = 1, . . . ,n,Yi((w,k(Xi, •))& + b) > 1 — ^; 

Vz = 1, . . . , n ^ > 0. 

This problem always has a solution and is usually referred to as the soft- 
margin SVM. It is common, although not systematical, for theoretical stud- 
ies of SVMs to introduce a simpler version of the SVM algorithm where one 
uses only hyperplanes containing the origin, that is, b is set to zero (although 
this version is admittedly rarely used in practice). This is mainly for avoid- 
ing some technical difficulties. We will adopt this simplification here, calling 
this constrained version "SVMo," and we will focus on it for the main part 
of the paper. 
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2.3. From regularization to model selection. It has been noticed by sev- 
eral authors [15, 30] that the soft-margin SVM algorithm can be formulated 
as the minimization of a regularized functional. Consider the primal opti- 
mization problem (2.1). For a fixed w, obviously the optimal choice for the 
parameters (£j) given the constraints is & = (l — Yi((w,k(Xi, -))k + b)) + . Now 
using the reproducing property of the kernel, we have (w, k(Xi, •))& = w(Xi), 
so the new formulation of the problem is (now denoting / instead of w) 

1 n 
(2-2) min - £(1 - YJpQ)) + + A n ||/|| 2 fc , 

; n i=i 

where A n = ^ and the minimum is to be performed over / G Ti^ (for the 
SVM algorithm) and for / G U\ = {x i-> g(x) + b\g G ?4, b G R} for the 
plain SVM algorithm. Note that || • \\k, inherited from TL^ to Tit, is only a 
semi- norm on TL k . 

Now, it is straightforward that the optimization problem (2.2) can be 
rewritten in the following way: 

(2.3) min J min - V(l -Yif(Xi)) , + A n R 2 I. 

This gives rise to the interpretation of the above regularization as model se- 
lection, where the models are balls in TL^ (or "semi- norm balls" in Ti\), and 
where the model selection is done using penalized empirical loss minimiza- 
tion. Also, it is now clear from equations (2.2) and (2.3) that the empirical 
loss used by the SVM is not the classification error (or 0-1 loss function), 
but the hinge loss function £ defined in the previous section. 

Denoting B{R) the ball of Ti^ of radius R, our interest in the main part 
of the paper is to study the behavior of SVMo vis— vis the family of models 
B(R), and the correct order of the regularization function to be used. 

3. Main result. 

3.1. Assumptions. We will present two variations of our main result. The 
difference between the two versions is in the way the capacity of the RKHS 
is analyzed. General assumptions on the RKHS 7ik and on the generating 
distribution are common to the two versions. Below we denote rj(x) = P(Y = 

l\X = x). 

(Al) Tik is a separable space (Note that the separability of 7ik is ensured, 

in particular, if X is a compact topological space and k is continuous on 
X x X.), and k(x,x) < M 2 < oo for all x£ X. 

(A2) ("Low noise" condition) Vx G X \v( x ) ~ 2I — 7 ?°- 
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The following additional assumption will be required only for setting (SI) 
below: 

(A3) Vx G X min(r/(x), 1 — rj(x)) > rj\. 

Our result covers the two following possible settings: 

Setting 1 (SI). Suppose assumptions (Al), (A2) and (A3) satisfied. In 
this first setting, the capacity of the RKHS is analyzed through the spectral 
properties of the kernel integral operator L^ : L 2 {Px ) — ► L 2 (Px ) defined as 

(3.1) (L k f)(x)= fk(x,x')f(x')dPx(x'), 



which is positive, self-adjoint and trace-class (see Appendix A for details). 
As a result, L^ can be diagonalized in an orthogonal basis of L 2 (Px), it has 
discrete spectrum Ai > A2 > • ■ • (where the eigenvalues are repeated with 
their multiplicities) and satisfies J2j>o^j < °°- F°r a fixed 8 > 0, we then 
define for n € N the following function: 

7(n)=7?r i 1 inf^ + ^1 /Va~Y 

Setting 2 (S2). Suppose assumptions (Al) and (A2) satisfied. For the 
second situation covered by the theorem, the capacity is measured via supre- 
mum norm covering numbers. In this situation, we assume that the RKHS 
TCk can be included via a compact injection into C(X) and we denote by 
Hoc {&H h ; e) the e-entropy number (log-covering number) in the supremum 
norm of the unit ball of TC^ ■ Denote 



(3.2) e(z)=y \JH QO (B Hk ,e)de, 

and let rr*(ra) be the solution of the equation £(x) = M~ l n 1 ' 2 x 2 . For a fixed 
5 > 0, define for n E N the following function: 

1 {n)=M- 2 xl(n). 

3.2. Statement. We now state our main result, which applies, in partic- 
ular, to the SVMo algorithm. 

Theorem 3.1. Consider either setting (SI) under assumptions (Al), 
(A2) and (A3), or setting (S2) under assumptions (Al) and (A2). Define 
the constant w\ = n\ for setting (SI) and w\ = 1 for setting (S2). 

Let 5 > be a fixed real number; and let A n > be a real number satisfying 

(3.3) A n > c\i{n) + w 1 I, 
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where c is a universal constant. Finally, let (p be a nondecreasing function 
on M + such that tp(0) = and tp(x) > x for x>-^. 

Consider the following regularized minimum empirical loss procedure on 
an i.i.d. sample ((Xi,Yi))i = i t ___ tn from distribution P, using the hinge loss 
function £(x, y) = (1 — xy)+ .' 

(3.4) g = KigMiJ-J2K9{Xi),Y i )+k n ^{M\\g\\ k )\ 

then if s* denotes the Bayes classifier, the following bound holds with prob- 
ability at least 1 — 5: 

(3.5) L(g,s*)<2 inf [L(g,s*) + 2A n ^(2M||< ? || fc )] + 4A n (2p(2) + cw^ 1 ). 

g£H k 

3.3. Discussion and comments. 

3.3.1. Discussion of the result. 

Adaptivity of the SVM. The most important point we would like to stress 
about Theorem 3.1 is that the regularization term and the final bound are 
independent of any assumption on how well the target function /* is ap- 
proximated by functions in Tik- This is an important advantage in the ap- 
proach we advocate here, that is, casting regularization as model selection. 
The model selection approach dictates a minimal order of the regulariza- 
tion, which is "structural" in the sense that it depends on some complexity 
measure of the models (here balls of Tik) and not on how well the models 
approximate the target. In simpler terms, the minimal regularizer depends 
only on the estimation error, not the approximation error. Our result is 
therefore an oracle type bound, which entails that the SVM is an adaptive 
procedure with respect to the approximation properties of the target by 
functions in Tik- From this bound, we can derive convergence rates to Bayes 
as soon as we have an additional hypothesis on these approximation prop- 
erties, while the procedure stays unchanged. We discuss this point in more 
detail in Section 3.4. 

Squared versus linear regularization. The second point we want to empha- 
size about Theorem 3.1 is that the minimum regularization function required 
to ensure that the oracle inequality holds is of order \\g\\k only (as a function 
of ||p||fc)- In the original SVM algorithm, a regularization of order \\g\\1 is 
used. The theorem covers both situations by choosing respectively <f(x) = x 
or (p(x) = 2x 2 . In view of the oracle inequality, the weaker the regularization 
term, the better the upper bound: provided that the oracle inequality holds, 
a weaker regularization will grant a better bound on the convergence rate. 
Therefore, this theorem suggests that [under certain conditions, i.e., mainly 
(A2)] a lighter regularization can be used instead of the standard, quadratic, 
one. 
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Of course, while a lighter regularization results in a better bound in our 
theorem, we cannot assert positively that the resulting algorithm will nec- 
essarily outperform the standard one: to draw such a conclusion, we would 
need a corresponding lower bound for the standard algorithm. Here we will 
merely point out the analogy of SVM to regularized least squares regression. 
Under a Gaussian noise assumption, the behavior of the regularized least 
squares estimator of the form (3.4) [with the square loss £(x,y) = (x — y) 2 
replacing the hinge loss] is completely elucidated (see [24], Section 4.4). In 
particular, the standard quadratic regularization estimator has an explicit 
form, from which it is relatively simple to derive corresponding lower bounds. 
As a consequence, in that case, it can be proven that a regularization that is 
lighter than quadratic enjoys better adaptivity properties than the standard 
one. In the present work, we have followed essentially the same driving ideas 
to derive our main result in the SVM setting, so that there is reasonable 
hope that the obtained bound indeed reflects the behavior of the algorithm. 
A complete proof of that fact is an interesting open issue. 

From hinge loss risk to classification risk. This theorem relates the rela- 
tive hinge loss E[^(g) — i(s*)] (where s* is the Bayes classifier) to the opti- 
mum relative loss in the models considered, that is, balls of TCk (see Section 
2.3). Furthermore, Lemma 2.1 ensures that the relative classification error 
is upper-bounded by the relative hinge loss error, hence, the theorem also 
results in a bound on the relative classification error. 

3.3.2. Discussion of the assumptions. 

About assumption (A2). This assumption requires that the conditional 
probability of Y given X should be bounded away from ^ by a "gap" tjq. 
Note that the knowledge of r]Q is not necessary for the definition of the 
estimator, as it does not enter in the regularization term. This quantity only 
appears as an additional term in the oracle inequality (3.5). Furthermore, for 
?7o not depending on n, this trailing term will become negligible as n — > oo, 
since the infimum in the first term will be attained for a function g n € TCk 
with \\g n \\k —* oo (see below Section 3.4 ). Assumption (A2) is a particular 
case of the so-called Tsybakov's noise condition, which is known to be a 
crucial factor for determining fast minmax rates in classification problems 
(see [23, 35]). 

A possible generalization. A more general Tsybakov's noise condition 
would be to assume, in place of (A2), that |^ — 7?(a;)| _1 £ LP for some p > 0. 
In this setting, it is possible to show (although it is out of the scope if the 
present work) that a result similar to (3.5) holds, with the same regular- 
ization function, except that the trailing term in (3.5) of order tjq A n gets 
replaced by a term of the form C(A„), with x < ((x) < \/x, where the exact 
form of C depends on the noise condition and the structural complexity anal- 
ysis of TCk- Obviously, in this general situation, the trailing term is no longer 
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necessarily negligible — whether or not this is the case will depend on the 
behavior of the first term of the bound, and therefore on the approximation 
properties of /* by TCk ■ The interpretation of this generalization is therefore 
more involved. 

About assumption (A3). The requirement that n should be bounded away 
from 0, 1 by a gap 771 is a technical assumption in setting (S2) needed as a 
quid pro quo for obtaining an explicit relation between regularization term 
and eigenvalues (see the short discussion before Theorem 6.6 in Section 6.3). 
While there does not appear to be an intrinsic reason for this assumption, we 
did not succeed in getting rid of it in this setting. Note that, in contrast to 
the previous point, the knowledge of rj\ is needed to define the regularization 
explicitly in this setting. While this assumption is somewhat unsatisfactory, 
it is possible, at least in principle, to obtain an explicit lower bound on the 
value of 771 by introducing deliberately in the data a small artificial "label 
flipping noise" (i.e., flipping a small proportion of the training labels). We 
refer to [9] (in the discussion preceding Corollary 10 there; the idea also 
appeared earlier in [39] ) where this idea is exposed in more detail. Note that 
the label flipping preserves assumption (A2) , albeit with a smaller gap value 

Vo- 

About setting (S2). An unsatisfactory part of the result for setting (S2) is 
that it is not possible to compute the value of the regularization parameter 
7(77.) from the data, since it requires knowledge of the eigenvalues of L&. The 
interest of this setting is to give an idea of what the relevant quantities are 
for defining a suitable regularization, in a way that is generally more precise 
than for setting (SI) (see discussion in the next section). Moreover, there 
is strong hope that estimating these tail sum of eigenvalues from the data 
(using, e.g., techniques from [3]) would lead to a suitable data-dependent 
penalty. 

3.3.3. Other comments. 

Multiplicative constant. The constant 2 in front of the right-hand side of 
equation (3.5) could be made arbitrarily close to one at the price of increasing 
the regularization function accordingly. Here we made an arbitrary choice 
in order to simplify the result. 

Deviation inequality vs. average risk. The above result states a deviation 
bound valid with high probability 1 — 5. Note that 5 enters into the regular- 
ization function, hence, it is not possible to directly integrate (3.5) to state 
a bound for the average risk. However, it is possible to obtain such a result 
at the price of a slightly heavier regularization (an additional logarithmic 
factor) . Namely, the proof of Theorem 2 essentially relies on a general model 
selection theorem (Theorem 4.3 in the next section) which covers both the 
deviation inequalities and average risk inequalities with minor changes in 
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the penalty function. For brevity, we do not state here the resulting theo- 
rem obtained for average SVM performance, but it should be clear that only 
minor modifications to the proof of Theorem 3.1 would be necessary. 

Using several kernels at once. Suppose we have several different kernels 
kx,...,kt at hand. Then we can adapt the theorem to use them simulta- 
neously. Namely, to each kernel is associated a penalization constant A„ ; 
the estimator g is given by (3.4) where we add another Argmin operation 
over the kernel index; and oracle inequality (3.5) is valid with an additional 
minimum over the kernel index; only 5 has to be replaced by 6/t for the 
price of the union bound. That such a result holds is straightforward when 
one takes a look at the model selection approach used to prove Theorem 3.1 
(developed Section 4). This is one of the advantages of this approach. 

3.4. Penalty functions and convergence rates for support vector machines. 

3.4.1. Convergence rates for the SVM. Let us first note from the defi- 
nition of 7 in both settings (SI) and (S2) that, generally, j(n) is of order 
lower than ra" 1 ' 2 . This is in contrast with some earlier results in learning 
theory where bounds and associated penalties often behave like n -1 ' 2 . Ac- 
tual rates of convergence to the Bayes classifier also depend on the behavior 
of the bias (or approximation error) term inf m _m <R L(g, s*). In most prac- 
tical cases, the functions in 7ik are continuous, while the Bayes classifier 
is not; hence, the Bayes classifier cannot belong to any of the models. If 
we assume that TCk is dense in Li(P), however (see also the stronger no- 
tion of "universal kernel" in [31]), then there exists a sequence of functions 
(g n ) 6 Tik such that u n = L(g n ,s*) — ► 0, implying consistency of the SVM. 
Moreover, if information is available about the speed of approximation [i.e., 
how inf iigiu <£■ L(g, s*) goes to zero as a function of R] and about the function 
7(ra) [depending either on eigenvalues or supremum norm entropy according 
to setting (SI) or (S2)], an upper bound on the speed of convergence of the 
estimator can be derived from Theorem 3.1. As noted earlier, in this case, 
using a regularization term of order \\g\\k instead of ||<?|| 2 always leads to 
a better upper bound on the convergence rate. The study of such approxi- 
mation rates for special function classes is outside the scope of the present 
paper, but is an interesting future direction. 

3.4.2. About the function 7 (n) in settings (SI) and (S2). The behavior, 
as a function of the norm \\g\\k, of the minimum regularization function 
required in the theorem does not depend on the setting. Its behavior as a 
function of the sample size n, however, does, since the complexity analysis 
is different in both settings. 

In order to fix ideas, we give here a very classical Sobolev space type 
example where we can explicitly compute the function 7 in both settings — 
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and where they coincide. Let us consider the case where X = T is the unit 
circle, the marginal Px of the observations is the Lebesgue measure, and 
the reproducing kernel k is translation invariant, k(x,y) = k{x — y) where k 
is a periodic function that admits the Fourier series decomposition 

k(z) = 2J flfc cos{2irkz), 

fc>0 

where (a&) is a sequence of nonnegative numbers. Obviously, the Fourier 
basis forms a basis of eigenvectors for the associated integral operator L\~ 
and the eigenvalues are Ai = clq, X 2 k = ^2k+i = Ofc/2 for k > 0. A function 
belonging to the RKHS / G Tik is therefore characterized by J2k> ^k fk = 

\\ < 00, where fk are its Fourier coefficients. 

Consider the case where Xk < k~ 2s for some s > \. Then computing the 
function 7 in setting (SI) yields 71(71) < tt, _2s /( 2s + 1 ). On the other hand, 
clearly TLk can be continuously included into the Sobolev space H s (T) . Uni- 
form norm entropy estimates for Sobolev spaces have been established (and 
can be traced back to [7]; see also [14], page 105 for a general result); it is 
known that H^BHs^jye) <£ _1 ' s ; hence, the function ^ appearing in set- 
ting (S2) is such that £(x) < n^ 1 )/ 25 , leading also to 72(71) < r r 2a K 2a+1 \ 

However, the fact that the two settings lead to a regularization of the 
same order seems very specific to this case, depending, in particular, on the 
properties of the Lebesgue measure and of Sobolev spaces. In a more general 
situation, if we assume the eigenvalues to be known, and rji to be a fixed 
constant, we expect the analysis in setting (SI) to give a tighter estimate for 
the minimal regularization function than the analysis in setting (S2); that is 
to say, the function 7(71) appearing in (SI) will be of smaller order than the 
one appearing in (S2). Informally speaking, this is because the eigenvalues 
of Lk are related to the covering entropy of the unit ball of Tik in L 2 norm, 
while setting (S2) considers covering entropy with respect to the stronger 
supremum norm. 

On the other hand, this tighter analysis comes at a certain price, namely, 
additional assumption (A3) and the requirement that the eigenvalues are 
known (or estimated), as already pointed out above. One advantage of supre- 
mum norm entropy is that, by definition, it is distribution independent. 
Furthermore, some relatively general results are known on this entropy de- 
pending on the regularity properties of the kernel function; see [41]. 

4. A model selection theorem and its application. 

4.1. An abstract model selection theorem. The remainder of the paper 
is devoted to the proof of Theorem 3.1. However, in the present section we 
change gears somewhat, forgetting voluntarily about the specific setting of 
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the SVM to present an "abstract" theorem resulting in oracle inequalities 
that can be obtained for model selection by penalized empirical loss mini- 
mization. This theorem is the cornerstone for the proof of Theorem 3.1. 

Our motivation for leaving momentarily the SVM framework for a more 
general one is twofold. On the one hand, we hope that it will make appear 
more clearly to the reader the general principle underlying our result, inde- 
pendently of the specifics of the SVM (which we will return to in the next 
section). On the other hand, we think that this result is general enough to 
be of interest of itself, inasmuch as it can be applied in a variety of different 
frameworks. 

The theorem is mainly an extended version of Theorem 4.2 of [22] to a 
more general setting, namely, where some key parameters, considered fixed 
in the above reference, can now depend on the model. This extension is 
necessary for our intended application to SVMs, which is exposed in Section 
4.2, and requires appropriate handling. However, the scope of this abstract 
model selection theorem can cover a wider variety of situations. Examples are 
the classical VC-dimension setting using classification loss (in this case the 
result of [22] is actually sufficient; see also the more detailed study [23]), or 
regularized Boosting-type procedures (see [9], where an earlier version of the 
model selection theorem presented here was used) . The fact that the theorem 
applies to approximate, rather than exact, penalized minimum empirical loss 
estimation is a minor refinement that is useful in certain situations: this will 
be the case for our application to SVMs, where the continuous regularization 
scheme will be related to an approximate discrete penalization scheme. 

We first need to introduce the following definition: 

Definition 4.1. A function ip: [0,oo) — » [0,oo) is sub-root if it is non- 
negative, nondecr easing, and if r \— > ip(r)/^/r is nonincreasing for r > 0. 

Sub-root functions have the following property: 

Lemma 4.2 ([3]). Let ^:[0,oo) — ► [0,oo) be a sub-root function. Then 
it is continuous on [0, oo) and the equation i/)(r) = r has a unique positive 
solution. If we denote this solution by r* , then for all r > 0, r > ip(r) if and 
only if r* <r. 

We can now state the model selection result: 

Theorem 4.3. Let £:& — ► L 2 (P) [where C L 2 {P)] be a loss function 
and assume that there exists g* G ArgMin 9G(S £[£(<?)]. Let {G m )m&M, Gm C <5 
be a countable collection of classes of functions and assume there exists the 
following: 
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• a pseudo-distance d on (5; 

• a sequence of sub-root functions (cj) m ),m G M.; 

• two positive sequences (b m ) and (C m ),m G M.; 

such that 

(Hi) VmeAi,v g eg m \\t(g)\\oo < b m ; 

(H2) Vg,g'£® yar(e(g)-£(g'))<d 2 (g,g'); 

(H3) VmeM,V ff eg m d 2 (g,g*)<C m L(g,g*y, 

and, if T* m denotes the solution of 4> m (r) = r/C m , 

Vm G M,Vg G £ m ,Vr > r* m 
(H4) 



E 



sup (P - P n )(£(g) - e(g )) 

g&Gm 
d 2 (g,go)<r 



<4>m(r). 






Ze£ (x m ) me _M be a sequence of real numbers such that J2meM e ~ Xm — 1- 
We assume that families (6 m ), (C m ), {x m ), m£A4, are ordered the same 
way, by which we mean that 

(4.1) Vm,m'G.M, x m < x m i =^ < 

Let £ > 0, K > 1 &e some reaZ numbers to be fixed in advance. Put B m = 
75KC m + 28b m , and let pen(m) be a penalty function such that, for each 
m G M, 

(4.2) pen(m) > 250*£ + ^ ( * m + J + bg(2)) . 

C m on 

Lei (p m )meM be a family of positive numbers andg denote a (p m )- approximate 
penalized minimum empirical loss estimator over the family (Q m ) using the 
above penalty function, that is, satisfying 

3m G M:g G <5~ and 

( 4 - 3 ) 

Pnt(g) + pen(m) < inf inf (P n £(g) + pen(m) + p m ); 
m&M g&Qm 

then the following deviation inequality holds with probability greater than 
l-exp(-f): 

L(g,g*)<— — — inf ( inf L(g,g*) + 2pen(m) + p„ 

K — 1 m.eM \g£Q m 
Furthermore, if the penalty function satisfies, for each m G M., 
(4.4) pen(m) > 250A^ + ^(x m + log(2)) + B m log B m 



Cm 3n n 
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then the following expected risk inequality holds: 

E[L(g,g*)]< K + 1/5 inf ( inf L(g,g*) + 2pen(m) + p m + - 
K — 1 meM \g&Q m n 

Remarks. 

1. Note that the difference with Theorem 4.2 of [22] is the fact that constants 
b m and C m can depend on m, which requires additional work, but is a 
necessary step for application to SVMs. 

2. In hypothesis (H4) <p(r 2 ) can be interpreted as the modulus of continuity 
with respect to d of the supremum of the empirical process indexed by 

g. 

3. The class (5 C L 2 (P) should be seen as the "ambient space"; it should 
at least contain all models. Note that choice of determines the target 
function g* (the minimizer of the average loss on <3). Typically, the the- 
orem will be applied with & = L 2 (P) or = L 2 (Px) (as will be the case 
below), but other choices may be useful. 

4. Although it is not its main purpose, this theorem can also be used for the 
convergence analysis of the empirical loss minimization procedure on a 
single model Q. Namely, it is sufficient to consider a model family reduced 
to a singleton and to disregard the penalty. This is also a situation where 
the choice of © can be of interest. If we make the choice <3 = Q, then 
the target g% is the best available function in the model Q. In this case, 
the bias term of the bound vanishes. By adding to the left and right of 
the obtained inequality the quantity L(g%,g*), where g* is the minimum 
average loss function over a larger class [e.g., L 2 (P)], it is then possible 
to obtain a constant 1 in front of the bias term (instead of j^r > 1)- 
However, this does not come completely for free since we must consider 
gg instead of g* when checking for assumption (H3). This assumption 
may actually be harder to check for in practice, because usually g* has 
a simple, closed form (e.g., the Bayes classifier in a classification frame- 
work), whereas g% depends on the approximation properties of model Q. 
Under certain convexity assumptions of the risk and of the model, it was 
shown in [4] that (H3) holds in this setting; this way we retrieve a bound 
in all points similar to single-model ERM results of [4]. 

4.2. Application to support vector machines. We now expose briefly the 
key elements needed to apply Theorem 4.3 to the SVM framework. Remem- 
ber that in the case of SVMs, the natural loss function to consider is the 
hinge loss function 1(g) = (x,y) i— ► (1 — yg(x))j r : this is the empirical loss 
which is minimized (subject to regularization) to find a classifier g. Inter- 
preting the SVM procedure as a penalized model selection procedure (see 
Section 2.3), we intend to apply Theorem 3.1. 
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To this end, we first discretize the continuous family of models {B(R))r & m. 
over a certain family of values of the radii: thus, our collection of models 
will be (B(R))n & ii, where 1Z is an appropriate discrete set of positive real 
numbers. We now have to check assumptions (H1)-(H4) of Theorem 4.3. 
The detailed analysis is exposed in Section 6.3 and the following statement 
sums up the obtained results: 

Theorem 4.4. Let 1Z be a countable set of positive real numbers, = 
L 2 (Px), and I the hinge loss function. 

In setting (SI) under assumptions (Al), (A2) and (A3), the family of 
models (B(R))r£-ji satisfies hypotheses (HI) to (H4) of Theorem 4.3 with 
the following parameter values: 

(MR 1 

b R = l + MR- C R = 2[ + — 



X 



Vi Vo 



Cl. J d , 771 



" .AT ^cn \ . A7 M \ ^-^ ■> i 



sph den \y/n M 



3>d 



In setting (S2) under assumptions (Al) and (A2), the family of mod- 
els (B(R))reti satisfies hypotheses (HI) to (H4) of Theorem 4.3 with the 
following parameter values: 

b R = l + MR; C r =(mR + —\- r R <2hmM- 2 C R xl{n), 

V Vo; 

where x* is as in the definition of setting (S2). 

Once assumptions (H1)-(H4) are granted, the remaining task in order to 
prove Theorem 3.1 is to formalize precisely how to back and forth between 
the continuous regularization and the discrete sets of models (B(R))r^tz. 
The details are given in Section 6.4. 

5. Discussion and conclusion. 

5.1. Relation to other work. In this section we compare our result to 
earlier work. The properties of the generalization error of the SVM algorithm 
have been investigated in various ways (we omit here the vast literature 
on algorithmic aspects of the SVM with which the present paper is not 
concerned). To this regard, we distinguish between two types of results: the 
first type are error bounds. They bound the difference between the empirical 
and true expected loss of an estimator. The second type are excess loss 
inequalities which relate the risk of the estimator to the Bayes risk. 
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5.1.1. Error bounds. The first result about the SVM algorithm is due 
to Vapnik; who proved that the fat-shattering dimension (see, e.g., [1] for a 
definition) at scale 1 of the set {(x, y) i— ► y(k(x, ■), f)k + b = yf(x) + b: f £ 
7~tk, \\f\\k < R,beM.} on a sample X\, . . . ,X n is bounded by D 2 R 2 , where D 
is the radius of the smallest ball enclosing the sample in feature space, which 
can be computed as D = inf flg ^ fe maxj = i n ||/c(Xj, •) — g\\k or, equivalently, 
D 2 :=m a x mi ^ 1 Y: =1 Pik(X l ,X i )-Y Ji JPiP J KX i ,X J ). 
ft>o 

This bound is known as the "radius-margin" bound since it involves the 
ratio of the radius of the sphere enclosing the data in feature space and of 
the (geometrical) margin of separation of the data which is equal to 1/R 
when the scaling is chosen such that the points lying on the margin (the 
"support vectors") have output value in {—1, 1}. 

The first formal error bounds on large margin classifiers were proven by 
Bartlett [2]. In these bounds, the misclassification error 8 (/) of a real- valued 
classifier / is compared to the fraction of the sample which are misclassified 
or almost misclassified, that is, which have margin less than a certain (pos- 
itive) value. In later work, it was noticed that for classes of functions such 
as B(R), the spectrum of the kernel operator [27] plays an important role in 
capacity analysis. 

More recent bounds on the capacity of such classes, involving Rademacher 
averages, have confirmed this role. We reproduce here a particularly elegant 
bound based on this technique (Theorem 21 of [5], slightly adapted for our 
notation): 

Theorem 5.1. Let R > 0; for any x > 0, with probability at least 1 — 
4e~ x , for all feB(R), 



P9(f)<PnW)Al]+' 

in 



Error bounds as the above are typically valid for any function in B(R) 
uniformly. They thus do not take into account the specificity of the SVM 
algorithm. Also, for an error bound, we cannot expect a better convergence 
rate than n -1 ' 2 of the empirical loss to the true average loss, since for a single 
function this is the rate given asymptotically by the central limit theorem. 

The term Y^i=ik(Xi,Xi) in the above theorem is the trace of the so- 
called Gram matrix (matrix of inner products of the data points in feature 
space) . Its expected value under the sampling of the data is precisely n times 
the trace of the kernel operator, that is, the sum of its eigenvalues. If we 
compare this to our main result Theorem 3.1, in setting (SI), we see that 
our complexity penalty is always of smaller order (up to a constant factor, 
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and to the relation between empirical and true spectrum, which we do not 
cover here, but is studied, e.g., in [8, 28]). 

In a different direction, in [11] are presented error bounds for regulariza- 
tion algorithms which explicitly involve the regularization parameter. 

5.1.2. Excess loss inequalities. Studying the behavior of relative (or ex- 
cess) loss has been at the heart of recent work in the statistical learning 
field. Some results have been developed specifically for regularization algo- 
rithms of the type (2.2). In particular, asymptotic results on the consistency 
of the SVM algorithm, that is, convergence of the risk toward Bayes risk, 
were obtained by Steinwart in [31]. 

Using a leave-one-out analysis of the SVM algorithm and techniques sim- 
ilar to those in [11], Zhang [40] obtained sharp bounds on the difference 
between the risk of the SVM classifier and the Bayes risk of the form 

E[e(f n )]-cE[£(f*)], 

where c > 1. However, because of this last strict inequality, this means that 
one cannot directly obtain information about the convergence L(f n ,f*) to 
zero from these results as soon as E£(/*) is nonzero. 

Studying the convergence of L(f n , /*) opens the door to complexity penal- 
ties that decrease faster than n -1 ' 2 , because the final goal is to compare 
directly the true average loss of the target and the estimated function, not 
their empirical loss. The so-called "localized approach" (that we followed in 
this paper) is a theoretical device used to prove such improved rates. Intro- 
duced in the statistical community for the general study of M-estimation, it 
has become widespread recently in the learning theoretical community; see, 
for example, [3, 4, 6, 16, 17, 20, 23, 25]. 

Concerning more specifically the SVM, recent works have concentrated 
on obtaining faster rates of convergence in various senses. In [12], the q- 
soft margin SVM is studied (i.e., when the considered loss function is £ q ) for 
q > 1. In [26], the SVM is studied from the point of view of inverse problems. 
In [32] , convergence properties of the standard SVM is studied in the case 
of the Gaussian kernel. In the above references, to obtain the best bounds 
on the rates of convergence, the regularization parameter A n (and, in the 
latter reference, the width of the Gaussian kernel) must have a prescribed 
decrease as a function of the sample size n, depending on a priori knowledge 
on regularity properties of the function /* (or rf). Therefore, these results 
do not display adaptivity with respect to the regularity of /*. 

In the recent paper [33], a general inequality for regularized risk minimiz- 
ers was derived, applying, in particular, to the SVM framework. The main 
differences in this work with respect to our framework are the following: 

• a general family of possible loss functions (which includes hinge loss and 
square loss) is considered; 
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• a general condition on the loss and the generating probability distribution 
is considered, covering, in particular, the general Tsybakov's noise setting 
for classification (but without adaptivity to this regard); 

• the regularization considered is fixed to be the squared RKHS norm; 

• the capacity of the kernel space is measured in terms of universal L 2 
entropy. 

While our work has obviously less generality concerning the first two points, 
our results are sharper concerning the two last ones. One of our main goals 
here was to study precisely what was the minimal order of the penalty 
with which we could prove an oracle inequality for the loss function used 
in the SVM. Furthermore, our setting (S2) relies on a capacity measure of 
the kernel space based on the spectral properties of the associated integral 
operator, which is sharper than universal entropy in this setting. Again, the 
approach we followed here was inspired by an analogy of the SVM with the 
more classical regularized least squares regression, which is by now relatively 
well understood, and where the optimal results concerning the two last above 
points are known to be sharper than those obtained in [33]. Our investigation 
was driven by the question of how much of these precise results could be 
carried over to the SVM setting. 

Finally, while our results demonstrate the adaptivity of support vector 
machines with respect to the approximation properties by the RKHS TCk of 
the target /*, we do not tackle the question of full adaptivity with respect 
to Tsybakov's noise condition. Only recently have results been obtained in 
this direction [18, 34, 36]. 

5.2. Conclusion. Summing up our findings, we have brought forth a gen- 
eral theorem allowing to derive oracle inequalities for penalized model se- 
lection methods. Application of this theorem to support vector machines 
has led to precise sufficient conditions for the form of the regularization 
function to be used in order to obtain oracle inequalities for the hinge loss. 
In particular, under the assumptions considered here about the probability 
distribution P(Y\X), the bound we obtain gets better if we use a linear 
regularizer in the Hilbert norm rather than the standard quadratic one. 

This result thus brings forth the interesting question of whether a SVM- 
type algorithm using a lighter (linear in the Hilbert norm) regularizer would 
yield improved practical results. Several issues are in play here. First a 
practical issue: a disadvantage of a linear regularizer is that the associ- 
ated optimization problem, although convex, is not as easily tractable from 
an algorithmic point of view as the squared-norm regularization. Second, a 
theoretical issue, namely, whether a corresponding lower bound holds, which 
would prove that the linear regularizer is indeed better. This is the case for 
regularized least squares in the Gaussian noise; for the SVM, lower bounds 
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remain very largely an open problem. And third, a crucial issue both theo- 
retical and practical, and not tackled here, is that the multiplicative factor 
A n in (2.2) is seldom taken equal to some a priori fixed function of n in 
practice. Instead, it is typically picked by cross-validation. It is important 
to bring into focus the fact that, even if the quadratic regularizer was sub- 
optimal for a fixed penalty scheme, this may still be compensated by the 
cross-validation step for the multiplicative factor A n , which could implicitly 
"correct" this effect. We believe this issue has not been studied in current 
work on SVMs, and that it is a central point to be studied in the future in 
order to reconciliate theory and practice. 

Several other mathematical problems remain open. Ideally, one would 
hope to obtain the same kind of result for the full SVM algorithm instead 
of the SVMo considered here. We mentioned in our comments after the 
main theorem a possible extension from our "gap" condition to a general 
Tsybakov's noise condition. This would give rise to an additional term for 
which we cannot always ensure that it is only a negligible remainder as the 
sample size grows to infinity. Therefore, the question of full adaptivity to 
Tsybakov's noise remains generally open. Finally, it is not clear whether our 
sufficient minimum rate conditions for the penalty are minimal: it would be 
interesting to investigate whether a lower order penalty would, for example, 
yield an inconsistent estimator. 

6. Proofs. 

6.1. Proof of Lemma 2.1. We start with proving (i). We can write 

E[£(g)]=E[r ] (X)(l-g(X)) + 

+ (l-r,{X))(l + g(X)) + ]. 

We will prove that, for each fixed x, s*(x) minimizes the expression in the 
expectation. Let's study the function g i— > rj(l — g) + + (1 — 77) (1 + g)+- It is 
easy to see that for rj € [5,1] it is minimized for g = 1, and for 77 £ [0, 5] it is 
minimized for g = — 1. This means that, in all cases, the minimum is reached 
at g = s* . Finally, it is easy to see that this minimum is unique whenever 
77 ^ {0, 2,1}, hence, /* = s* a.s. on this set. (Notice additionally that, for 
77 = 1, any g > 1 reaches the minimum, for 77 = 0, any g < — 1 reaches the 
minimum and for r\ = |, any gg [— 1, 1] reaches the minimum.) 

We now turn to (ii). Considering (i), we can arbitrarily choose /* = s* . 
We then have to prove that 

E[l{Yg(X) < 0} - l{Ys*(X) < 0}] 

<E[(l-Yg(X)) + -(l-Ys*(X)) + ]. 

We know that the right-hand side is nonnegative. Moreover, the random 
variable in the left-hand side is positive (and thus equal to 1) if and only 
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if Yg(X) < and Ys*(X) > 0, in which case (1 - Yg(X))+ > 1 and (1 - 
Ys*(X)) + = (since s* takes its values in {—1, 1}). This proves the inequal- 
ity. 

6.2. Proof of Theorem 4.3. To prove Theorem 4.3, we first state the 
key technical result concerning a localized uniform control of an empirical 
process. 

Theorem 6.1. Let T he a class of measurable, square integrable func- 
tions such that for all f £ T , Pf — / <b. Let w(f) be a nonnegative function 
such that Var[/] < w(f). Let <p be a sub-root function, D be some positive 
constant and r* be the unique positive solution of 4>{r) =r/D. Assume that 
the following holds: 



(6.1) 



W > r* 



E 



sup (P - P n )f 
feF:w(f)<r 



< 



Then, for all x > and all K > D/7, the following inequality holds with 
probability at least 1 — e~ x : 



D 2 



v 



Lf additionally, the convex hull of J- contains the null function, the same is 
true when the positive part in (6.1) is removed. 

Note that this result is very similar to Theorem 3.3 in [3] which was 
obtained using techniques from [21]. We use similar techniques to obtain 
the version presented here. 

We will need to transform assumption (6.1), using the following technical 
lemma which is a form of the so-called "peeling device" ; the version presented 
here is very close to a similar lemma in [22]. 



Lemma 6.2. If cj) is a sub-root function such that for any r > r* > 0, 



E 



V [ sup Pf- P n f 

f£F:w(f)<r 



<<Kr), 



one has for any r >r* , 



E 



Pf-Pnf 

SUP / n , 

feT w{f) + r 



<4 



4>(r) 



and when € convj 7 , the same is true if the positive part is removed in the 
previous condition. 
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Proof. We choose some x > 1. In the calculations below a supremum 
over an empty set is considered as 0. We have 

Pf ~ Pnf 

SUP T-p— 

f&F w{f)+r 

(Pf - P n f)+ «-, (Pf- P n f)+ 

^ . _ SU P„ . T^l^rz + L , su p . 



f&F:w{f)<r w(f)+r k>0f£F:rx k <w(f)<rx k +! w (f) + r 

<i sup (P/-P B /) + + £ .sup ,„ (P {" fc Pn/) - 



r f£F:w(f)<r k>0f£F:rx k <w(f)<rx k + 1 rx + r 

<-f sup (P/-P n /)+ + i: sup 



I jt \ «/ — tbd j~r • ^ ^ jr lift / 

r \feF:w{f)<r k>of£F :w (f)< rxk+1 l ~r x J 

In the general case, note that sup ae ^(0Va) = 0Vsup ae ^ a. In the case where 
convj 7 contains the null function, one has supj e: p Pf — P n f = supj econv:F Pf - 
Pnf > so that sup f e:F (Pf — P n f)+ = supffzjrPf — P n f, which allows us 
to remove the positive part in the condition for <f>. 
So, taking the expectation, we obtain 



E 



Pf ~ Pnf 

sup . . 
VfaT w(f)+r 



< 



1 / , , , ^ ^(rx( fc+1 ) 



^ 1 



rV fc Vo l + xfc 



< 



' g (fe+l)/2 \ 



<^)[l +x l/2[i + Vx- 



<^2fl fx l/2f'l + 



.2 X l/2-l 

where we have used the sub-root property for the second inequality. It is 
then easy to check that the minimum of the right-hand side is attained at 

x = (l + v / 2) 2 . 
Plugging this value in the right-hand side, we obtain the result. □ 

Proof of Theorem 6.1. The main technical tool of the proof is Ta- 
lagrand's concentration inequality (here we use an improved version proved 
in [10]). We recall it briefly as follows. 

Let Xi be independent variables distributed according to P, and T a set 
of functions from X to R such that E[/] = 0, ||/||oo < c and Var[/] < a 2 for 
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any / G T . Let Z = sup^ e:F Xir=i /PQ)- Then with probability 1 — e~ x , it 
holds that 



j ncr 

(6.2) Z<EZ + ] f2x(na 2 + 2cE[Z}) + — . 

We will apply this inequality to the rescaled set of functions 

where we assume r >r* . The precise choice for r will be decided later. We 
now check the assumptions on the supremum norm and the variance of 
functions in J- r . We have 

Pf - fix) b 

sup sup ■ -r^r- < -; 

feFxeX r + w{f) r 

and, recalling the hypothesis that Var[/] < w(f), the following holds: 

f(X) 



Var 



w(f)+r 



- (w(f)+rf- 4rw(f) ' ' 



where we have used the fact that 2ab <a? + b 2 . Introducing the following 
random variable 

(6.3) V r = sup , 

f&F w{f)+r 

we thus obtain by application of (6.2) that, with probability at least 1 — e~ x , 



6.4 v r <E\y r ] + J— + — + ^— ■ 

y Irn rn 6rn 

It follows from Lemma 6.2 that E[l^] < £</>(r)/r. Plugging this into (6.4), 
and recalling that r* is the unique solution of (j)(r) = r/D, we obtain that, 
for all x > 0, and r > r*, the following inequality hold with probability at 



V/G^ 
(6.5) 



w (f)+ r o>o\ D V r V 2nr \3 ajrn 



p !- p "' <^(aI±^^ + ^ + (1 + 1 -\^ 



Here, we have used the fact that, for r >r* , <j){r)/r < \Jr* /rD 2 and that 
2\fab< aa + 6/a. 

Now given some constant X, we want to find r >r* such that V r < 1/K 
(with high probability). This corresponds to finding r such that the left-hand 
side of (6.5) is upper bounded by 1/K. 
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Denote A 1 = 4(1 + a)\/r*/D + yjx/2n and A 2 = (1/3 + l/a)bx/n. Then 
we have to find r such that Air" 1 ' 2 + A 2 r~ l < iiC - . It can be easily checked 
that this is satisfied if 

(6.6) r > K 2 A\ + 2A 2 K. 

We have 

K 2 A\ + 2A 2 K < 32(1 + a) 2 ^-^ + -(K 2 + 2bK/3 + 2bK/a). 

D n 

Taking a = 1/4, we conclude that (6.6) is satisfied when the following holds: 

r > 50^r* + (K 2 + 9bK)-. 
D z n 

Note that K > D/7 ensures that the lower bound above is greater than r*. 
We can thus take r equal to this value. 

Combining the above results concludes the proof of Theorem 6.1. □ 

We are now in a position to proceed to the proof of the main model 
selection theorem. 

Proof of Theorem 4.3. The main use of hypotheses (HI), (H2) and 
(H4) will be to apply Theorem 6.1 to the class 

Fm,g = {£(9) ~ Z(9o), 9 € Qm} 

for some m £ M.-,go £ Q m with the choice w(f) = mm{d 2 (g, go)\g £ QmAis)~ 
(■{go) = /}, so that, using hypotheses (HI), (H2), (H4) and the fact that the 
null function belongs to the class, we obtain that, for any arbitrary K > C/7, 
with probability at least 1 — e~ x , 

^g£Q m 

(6.7) 

(P ~ Pn)d(g) - ((go)) < K^d 2 (g,g ) + ^r* + (K " + ^ . 

For each m £ M. , we define u m and g m as functions in Q m satisfying, 
respectively, 

d(u m ,g*) = inf d(g,g*), 

g&Gm 

L{g m ,g*)= mf L(g,g*). 

g&ym 

[If these infima are not attained, one can choose u m ,g m such that d(u m ,g*), 
L(g m ,g*) are arbitrary close to the inf, and use a dominated convergence 
argument at the end of the proof.] 
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Now, for any m G M, g m £ T m , 

L(g,9*)-L(9m,g*) 

= P£(g) - P£(g m ) 
(6.8) 

= P n £(g) - P n £(g m ) + (P - P n )(£{g) - £(g m )) 

< pen(m) - pen(m) + p m + (P - P n )(£(g) - i(9m)) } 

where the last inequality stems from the definition of g. 

Denoting fh the model containing g, we decompose the last term above: 

(P-P n )(e(g)-l(g m )) = (P-P n )(£(g)-l(u~)) 
(6.9) 

+ (P-P n )(£(u~)-£{g m )). 

We will bound both terms separately. For the first term, we use (6.7): for any 
w! € M. and an arbitrary K m / > C m //7, with probability at least 1 — e~ Xm '~^, 
for all g £ Q m > we have 

(P - P n )(£(g) - l(u m ,)) 
(6-10) <K-}d 2 (g,u m ,) + ^tr* m , 



(K m > + % m ')(x m ' + i) 



n 



By the union bound, this inequality is valid simultaneously for all m! £ M. 
with probability 1 — e - ^, so that it holds, in particular, for mf = fh, g = g 
with this probability. Finally, note that, for g £ Q~, 

(6.11) d 2 (g,u~)<(d(g,g*) + d(u~,g*)) 2 <±d 2 (g,9*)- 

For the second term of (6.9), we will use the following Bernstein inequality: 
for any mi,m2 E M., we have, with probability 1 — exp(— x mi — x m2 — £), 

(p - p n )(e( Umi ) - e( 9m2 )) 



(6.12) <j2(, mi+ , m2+ 0^- (Wmi) "^ m2)] 



+ 



n 

max(O mi , b m -2 ) {%mi + x m,2 + £) 



6ra 
Now, using assumption (4.1), if b m * = max(6 mi ,6 m2 ), 

max(^0 mi , m2 Jv^mi ~T~^m2) ^ ^o rn *x m * _^ ^u rni x rni -\- ZO m2 X m2 . 
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We now deal with the first term of the bound (6.12): for any g G Q mi , 



W^ + o Var[<( *i- nwl 



< U* m + x m + ( d2 («^-<l') + 'P(s^S')) 



< 2\ (x mi + x m2 + £) h 2W (x mi + x m2 + ^) 

V n V n 

<^ mi d (5,5 ) + A m2 G( (gvn 2 ,g JH , 

where the first inequality follows from hypothesis (H2) followed by the tri- 
angle inequality, and the second from the definition of u mi . Anticipating 
somewhat the end of the proof, we will choose K m = aC m for some fixed a, 
so that, using again assumption (4.1) like above, it is true that 

(-^mi "T A m2 )\X mi -\- X m2 ) j; 4ii mi X mi + 4 J\ rn2 %m,2 ■ 

Therefore, (6.12) becomes, with probability 1 — exp(— x mi — x m2 — £), for 
any g£G mi , 



(6.13) 



(P - P n )(£( Umi ) - £(g m2 )) 

<K-ld 2 (g,g*)+K-ld\g m2 ,g*) 
(12K mi +b mi )(x mi +£) 



+ 



+ 



3re 

(l2K m2 +b m2 )(x m2 + £) 
3n 



Bound, (6.13) is therefore valid for all mi,m2 £ M. simultaneously with 
probability 1 — exp(— £), and, in particular, for m\ = fh,m2 = m,g = g. 

Putting together (6.9), (6.10), (6.11) and (6.13), we obtain that, with 
probability 1 — 2exp(— £), for all m £ J\4, 

(P - P n )(£(g) - £(g m )) 



(6.14) <5K^d z (g,g*) + K- i d z (g m ,g*) + 



C 



2 



| (15if~+28ft~)(x~+£) | (12K m + b m )(x m + Q ^ 

Now choosing K m = 5KC m (note that we have K m > C m /7 as required, 
since K > 1), and replacing £ by £ + log (2), recalling inequality (6.8) and 
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the hypothesis (4.2) on the penalty function, we thus obtain that, with 
probability 1 — exp(— £), for any m £ M, 



L{g,g*)-L(g m ,g 



< pen(m) - pen(m) + C- 1 K' 1 d 2 (g , g*) + W' 1 K' 1 £ \g m , g*) 



+ pen(m) + pen(m) + p m 

< K^Lfag*) + lK- l L(g m ,g*) + 2pen(m) + p m , 

using hypothesis (H4). This leads to the conclusion for the deviation in- 
equality of the model selection theorem. 

For the inequality in expected risk, we go back to inequality (6.14), with 
the choice K m = 5KC m ; also using (6.8), we conclude that, for any £ > 0, 
the following inequality holds with probability 1 — exp(— £): 

L(g,g*)-L(g m ,g*) 

< K~ X L{^, g*) + \K~ x L(g m , g*) + pen(m) - pen(m) + p m 
(6.15) 

I 250ift7~ B~{x~+Z + \og(2)) 

Dl m 3n 

m 

£ m (x m + g + log(2)) 
3ra 
The point is now to linearize the products B m t;. To do so, we use the fol- 
lowing Young's inequality valid for any positive x,y: 

xy <expf-j + 2ylogy, 

with x = £, y = B m , so that, putting u = exp(£/2), and using the hypothesis 
(4.4) on the penalty function, we obtain that, with probability 1 — (u~ 2 A 1), 

L(g,g*) - L(g m ,g*) < K^L&g*) + ^K' 1 L(g m , g*) 

(6-16) 

+ 2pen(m) + p m + — . 
n 

Integrating concludes the proof. □ 

6.3. Proof of Theorem 4.4. The purpose of Theorem 4.4 is to check that 
conditions (HI) to (H4) of the general model selection Theorem 4.3 are 
satisfied for settings (SI) and (S2) of the SVM. We will split the proofs into 
several results corresponding to the different hypotheses. 

Lemma 6.3. Under assumption (Al), hypothesis (HI) is satisfied for 
b R = MR+l. 
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Proof. We use the reproducing property of the kernel to conclude that 

Vg£B(R) \£(yg(x))\<l + \g(x)\ 

= l + \(g,k(x,-))\ k 
<l + \\g\\ k \\k(x,-)\\ k 



= l + \\g\\ k y/k(x,x) <1 + MR. □ 

We now check conditions (H2) and (H3). This differs according to the set- 
ting, because we make a different choice for the pseudo-distance d depending 
on the setting considered. 

Lemma 6.4 [Setting (SI)]. Under assumptions (Al), (A2) and (A3), 
conditions (H2) and (H3) of Theorem 4.3 are satisfied for the choice 

d 1 (g,g')=E[(g-g') 2 ]; C R = 2(— + ±- 

Proof. Obviously, (H2) is satisfied since £ is a Lipschitz function, so 
that \£(yg(x))-£(yg'(x))\ < \g(x) -gf(x)\. 

We will obtain the result we look for if we can bound uniformly in x the 
ratio 

E[(g - s*) 2 | X = x] 
E\£(q)-£(s*)\X = x]' 



Remember that, for g S B(R), the reproducing property of the kernel and 
assumption (Al) imply ||<?||oo < -^IMIfc < MR. Let us consider without loss 
of generality the case s* = 1 (i.e., rj = F[Y = 1\X = x] > ^). We then have to 
bound the ratio 

(1"5) 2 



r,(l-g) + + (l-r,)(l+g) + - 2(1-7,) 

(i-g) 2 . 

r,(l-g)-2{l-r,) ' 



For g < — 1, this becomes v d_ a )J?2(i-n) ' Putting x = ~9 ~ 1 G [0, MR], this 
can be rewritten as 

(x + 2) 2 2x 2 + 8 ,,,„ 2 (MR 1 

— r^ n < 7 7 < ^MR + — < 2 h — 

r\x + 2(2r\ - 1) r]x + 2(2r/ - 1) ??o V »7i % 

where we have used the fact that r, > i > r\\ . For g > 1 , this becomes jE— , 
which is smaller than (MR — l)/??i for g € [1, MR]. For g G [—1, 1], the ratio 
becomes 2 ~ ^ , which is smaller than 1/770- D 
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Lemma 6.5 [Setting (S2)]. Under assumptions (Al) and (A2) ; condi- 
tions (H2) and (H3) of Theorem 4.3 are satisfied for the choice 

d 2 (g,g')=E[(£(g)-£(g')) 2 ]; C r =(mR+- 

Proof. Obviously, (H2) is satisfied as before. We will obtain the result 
we look for if we can bound uniformly in x the ratio 

E[£(g) 2 - 2£(g)£(s*) + £ 2 (s*) \ X = x] 
E[£(g) - £(s*) \ X = x] ' 

Notice first that 

E[£ 2 (s*) \X = x]=2E[£(s*) \ X = x] = Amm(rj(x),l - rj(x)). 

Let us first consider the case s* = 1 (i.e., r\ > i). The above ratio can be 
written 

r/(l - g)\ + (1 - V )(l + <?)+((! + g)+ - 4) + 4(1 - V ) 



r)(l-g) + + (l-r,)(l+g)+-2(l-r,) 

r ? (l-g) 2 +4(l-7 ? ) 
7 ? (l-g)-2(l-7 7 ) 



For # < — 1, this becomes „ n _ g A_j" 2 A_Z( ; putting x = — g — 1 G [0, MJR], this 



can be written as 

r/x 2 + 4r?x + 4 4 + 2x 1 

7 r = x -\ r- < MR H . 

r/x + 2(2r/ — 1) r/x + 2(2r/ — 1) r)o 

For (7 > 1, this becomes ' "j^ = g — 1, which is smaller than Mi? — 1 for 
g G [1, Mi?]. For g G [— 1, 1], this becomes 2 ~ 3 X , which is smaller than I/770- 
The case 77 < 5 can be treated in a similar way. □ 

Finally, we check for hypothesis (H4); this condition characterizes the 
complexity of the models and constitutes the meaty part of Theorem 4.4. 
We start with the following result which deals with setting (SI). Here we 
can see the (technical) reason why assumption (A3) was introduced in this 
setting: to relate the penalty to the spectrum of the integral operator, we 
use the L 2 distance d± as an intermediate pseudo-distance; but this requires 
in turn, assumption (A3) to check hypothesis (H3) (see Lemma 6.4 above). 

Theorem 6.6. Let Q be a RKHS with reproducing kernel k such that 
the associated integral operator L^ has eigenvalues (Aj) (in nonincreasing 
order). Let £ be the hinge loss function and denote d\{g,u) = P(g — u) 2 . 
Then, for all r > and u G B(R), 



E 



sup \(P-P n )(£(g)-£(u))\] <^=M[Vdr' + 2R Jj2 X i) -=Mr) 
d\{g,u)<r 
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:W 



The above <f)R is a sub-root function, and the unique solution of <j)R,{r) 
t/Cr, with Cr > f]^ MR, is upper bounded by 



C\ 



d , m 






To prove Theorem 6.6, we will use two technical results; the first will 
allow to bound the quantity we are interested in by a localized Rademacher 
complexity term; the second will give an upper bound on this term using 
the assumptions. 

We introduce the following notation for Rademacher averages: let o\, . . . ,o~ n 
be n i.i.d. Rademacher random variables (i.e., such that P[<Tj = 1] = P[uj = 
— 1] = tj), independent of (Xj, Yj)" =1 ; then we define for any measurable real- 
valued function / on X x y 



(6.17) 



Rnf-n-^PifiXuYi). 



i=l 



We then extend this notation to sets T of functions from X x y to M, 
denoting 

R n T = sup R n f. 
fer 

We then have the following lemma: 



Lemma 6.7. Let J 7 be a set of real functions; let eft be a 1-Lipschitz 
function on M. Then for g$ G T , 



E 



sup \(P - P n )(4> o g - (f) o g )\ 



<mR n {g-g :gef}. 



Proof. By a symmetrization argument, we have 



E 



SU V \(P-P n )4>og-{P-P n )<t>og \ 
gdF 



<2E 



SUp|i? n (0o 5 -0 O5 r o )| 
■9^ 



and by symmetry of the Rademacher random variables, we have 



E 



sup \R n (4> °g-4>°go ) 

g&T 



<2E 



sup(i? n (^o 5 -0og o )) + 

gdT 



Since go £ T , choosing g = go, one notices that 



E 



sup(R n ((f)og-(f)og ))_ 
gdT 



E 



sup(i?n(^ og-(j)og )) 

ger 
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and since go is fixed, and ER n (f> ° So = 0, we obtain 



E 



sup \(P - P n )<t> og-(P- P n )0 o g \ 



<4E 



SUpRn((/)og) 
gdT 



Since eft is 1-Lipschitz, we can finally apply the contraction principle for 
Rademacher averages; then using KR n go = 0, we obtain the result. □ 

The next lemma gives a result similar to [25], but we provide a slightly 
different proof (also, we are not concerned about lower bounds here). The 
principle of the proof below can be traced back to the work of R. M. Dudley. 



Lemma 6.8. 



ER n {g G H k : \\g\\ k < R, \\gg P < r} < -L inf ( v^ + R /Va" 



< 



y^mm(r,R 2 \j) 



1/2 



Vf>l 



Proof. For g G Tik, by Lemma A.l in the Appendix, we can decompose 
9 as 

i>0 

with HsIIIp = Z)i>o-^* a ? an d Iblll = Y^,i>o a i- The above series representa- 
tion holds as an equality in Ti^, and hence pointwise since the evaluation 
functionals are continuous in a RKHS. Let us denote 



T(R,r) 



a£k \\a\\ < 



R 2 ,J2 X ^^ r \- 

i>0 ) 



Thus, the quantity we try to upper bound is equal to 



n 



sup 

a£T(R,r) 



y^aig a (Xi 



i=l 



where 



g a (Xi) = J2 a jiPj( x i 

j>0 



We now write for any nonnegative integer d and a G T(R,r) 



^2<Tig a (Xi] 



i=l 



j>0 i=l 
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(6.18) 



< 



E^E^'^) 

j<d i=\ 



+ 



j>d i=l 



Applying the Cauchy-Schwarz inequality to the second term, we have 



E^iE^jPQ 

j>d i=\ 



< E«f Efe^w 

\j>d / \j>d \i=l I / 






2\ 1/2 



We now take the expectation with respect to (ctj) and (Xj) in succession. 
We use the fact that the (<Tj) are zero mean, uncor related, unity variance 
variables; then the fact that EfX 1 / 2 ] < Epf] 1 / 2 , to obtain 



E X E CT 



E^E^^*) 

j>d i=\ 



<RE 



x 



EE#W 

L \j>di=l / 



1/2' 



<vWe a ; 
\j>d 



1/2 



where we have used the fact that Ex[^?(-X")] = Xj. We now apply exactly 
the same treatment to the first term of (6.18), except that we use weights 
(A,) in the Cauchy-Schwarz inequality, yielding 



E x E ff 



E^E^pq; 

j<d i=l 



<fev 2 ) e x 
\j<d ) 

< Vnrd. 



Lj<di=l 



.1/2 



This gives the first result. The second one follows from choosing d such that, 
for all j > d, R 2 Xj < r, and using the inequality yA + vB < y/2y/A + B. 

□ 



Proof of Theorem 6.6. For g a function X —> K, let us briefly intro- 
duce the notation g: (x,y) £ X x y i— > yg{x) G M. Let us apply Lemma 6.7 
to T u = {g,g£ F u }, where J r u = {g£H k : \\g\\k < R,d 2 (g,u) < r}. The hinge 
loss function £ is 1-Lipschitz, and uG J 7 , hence, 



E 



S up\(P-P n )(£(g)-£(u))\ 
gar 



< ^R n {g-u,ge T u } = AER n {g - u , g e^ u }, 
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where the last equality is true because of the symmetry of the Rademacher 
variables. Notice that, since ||u||fc < R, we have 

{g-u,g£T lL } C {g-u,\\g-u\\ k <2R,d 2 (g,u) <r}; 

since d 2 (g, u) = E[(<? — u) 2 ] is a norm-induced distance, we can replace g — u 
by g (by linearity) so that the above term can be upper bounded by 

4E# n {<? G H k : ||0||* < 2R, \\g\\ 2>P < ^}. 

Using Lemma 6.8, this can be further upper bounded by 



4^ + 2«/|>), 



j>d 

which concludes the proof of the first part of the theorem. 

Observe that the minimum of two sub-root functions is a sub-root func- 
tion, so that 4>r is a sub-root function. We now compute an upper bound 
on the solution of the equation <pn{r) = r/C, which can be written 






Notice that the infimum is a minimum since the series 2~2j>i ^j i s converging 
and, thus, the value of the right-hand side is bounded for all d and goes to 
00 when d — > 00 . Let us then consider the particular value of d where this 
minimum is achieved. Solving the fixed point equation for this particular 
value, we have 



^fV3 + 



11 



d + 8 v / ni?(4C)- 1 J2 X J 

\ V 3>d > 



Now for any other value d' ^ d, r* satisfies 

which means that r* is smaller than the largest solution of the corresponding 
equality. As a result, we have 

\ 2 



r* = inf v d + 



d + 8 v ^ J R(4C)- 1 Y. X i 

\ \ j>d 1 



Using (a + 6) 2 < 2(a 2 + 6 2 ), putting C = Cr and finally using the assumption 
RCr 1 < ijiM" 1 yields the result. □ 
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This concludes the proof of Theorem 4.4 for setting (SI). We finally turn 
to checking hypothesis (H4) in setting (S2): in this case we use a classical 
entropy chaining argument. 

Theorem 6.9. Under assumption (Al) and the notation of setting (S2), 
we have 



E 



SUp \(P-P n )(£(g)-£(g ))\ 

\\g\\k<R 
dl(g,go)<r 



< 



*q e A/r\ | gMg r WxM : 



\2R 



n 



\2RJ 



ipR{r) 



where the function £ is defined as in (3.2). The function i/jr is sub-root; if x* 
denotes the solution of the equation £(x) = M n ' x , then the solution 
r* R of the equation ipu(r) = C R r, with Cr > MR, satisfies 

r* R < 2500M~ 2 C R xl. 

The chaining technique used for proving this theorem is summed up in 
the next lemma, for which we give a proof for completeness. 



Lemma 6.10. Let T he a class of real functions which is separable in 
the supremum norm, containing the null function, and such that every f £ T 
satisfies ||/||oo < M and K[/ 2 ] < a 2 . Denote H' 00 (e) the supremum norm e- 
entropy for T . Then it holds that 



(6.19) 



E 



sup|(P-P n )/| 

fGF 



24 f a i 

<^ / JH OQ (e)de + 
V n Jo v 



MH^a) 



■n 



Proof. It is a well-known consequence of Hoeffding's (resp. Bernstein's) 
inequality that a finite class of functions Q bounded by M in absolute value 
have 



(6.20) 



E 



sup(P - P n )g 
g&G 



< 



, M 2 io g (\g\) 

n 
-2l ^ Jl 



respectively, if, additionally, it holds that M[g ] < a for all g & Q, we have 
(6.21) 



E 



sup(P - P n )g 
g&G 



/ 2 ^log(|g|) | Mlog(|g| 



n 



Since T contains the null function, it is clear that 



(6.22) 



E 



sup|(P-P„)/| 

/6^ 



<E 



sup(P-P n )/ 
feF 



+ E 



'3n 



sup(P n - P)f 



:-',s 
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Since we have assumed that J- is separable for the sup norm, it is sufficient 
to prove (6.19) for any finite subset of T . Without loss of generality, we 
therefore assume that T is finite. Put Si = o2~ % and let, for any / G T , IT/ 
be a member of a 5j-supremum norm cover of T such that ||IL/ — /||oo < <5j. 
We write 



E 



sup(P-P n )/ 



<E 



su P (p-p„)n / 



j>0 



feF 



We now apply (6.21) to the first term of the above bound and (6.20) to all 
of the other terms. More precisely, we apply (6.21) to the class {n /, / £ J 7 } 
which has cardinality bounded by exp(H 00 (5o)), and respectively (6.20) to 
the classes {(Ilj/ — IL_i/),/ G J 7 } which have their respective cardinality 
bounded by exp(2H oc (5i)). We then have 



E 



sup(P - P n )f 

far 



< 



< 



n 3n .„ 

i>0 



'36<52 



12 



n Jo 



H OQ (e)de + 



3» 



n 



HooiSi) 



3n 



We apply the same inequality to the class — T and conclude using (6.22). 

□ 



Proof of Theorem 6.9. We want to apply Lemma 6.10 to the class 
of functions 

F go = {£( g )-£( go ):ge H k ; \\g\\ k < R;E[(£(g) - £(g )) 2 } < r}. 

Similarly to the reasoning used in the proof of Theorem 6.6, it is clear that 

T go C F(2R, r) = {1(g); g G H k ; \\g\\ k < 2R; E[£(gf] < r}. 

Because the loss function £ is 1-Lipschitz, it holds that \\£(f) — £(g)\\ < 
||/-5||oo, hence, H O0 (T(2R,r),e) < H 00 (B Hk (2R),e). Applying Lemma 6.10 
therefore yields 

E 



SUp \(P - P n )(£(g) - £(g ))\ 

\gh<R 
d%{g,go)<r 



24 /"v^ / 

<->=/ jH OQ (2RB Hk ,e)de + 
xrnJo v 



2MRH OQ (2RB Hk ,V^) 



11 



A8R K r ' 2R 



n Jo 



H 00 (B Hk ,e)de + 



2MRH 00 (B Hk ,y/f/2R) 



n 
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4SR^/Jr\ 8MR 3 .Y^ 2 , . N 

where £ is defined as in (3.2), and the last inequality comes from the obser- 



vation that £(x) < xJH OQ (B-}i k ,x). The function tpR is obviously sub-root 
since H 00 {Bn k ,e) is a decreasing function or e. 

Denote x* the solution of the equation £(x) = M _1 - v /nx 2 ; we claim that, 
for a suitable choice of constant c, t R = c 2 M~ 2 C R x\ is an upper bound for 
the solution r R of the equation ipn{r) = C R r entering in hypothesis (H4). 
This is implied by the relation ipR.{t* R ) < C^H R which we now prove. 

Note that -^P- = c jRM 00 *' anc ^ ^hat rm — ^ Since x _1 £(x) is a decreasing 
function, assuming c > 2, it holds that 

H 2R)- C 2RM^ X *>~ C 2RM^ X *~cRC R R - 
Plugging this into the expression for i/jr yields 

, ,^, /48 8MR\ t* R ^ /48 8\ ft 

where we have used again the relation MR < Cr. The choice c = 50 implies 
the desired relation. □ 

6.4. Proof of Theorem 3.1. Theorem 4.4 states that the conditions (Hl)- 
(H4) of the model selection theorem (Theorem 4.3) are satisfied for the 
family of models B(R),R G 72. and some explicit values for bR,CR,<f)R and 
r|j [depending on the considered setting (SI) or (S2)]. Let us choose an 
appropriate finite set 1Z and a sequence (xr)r & tz so that we can approximate 
the minimization over all R > in equation (3.4) by a minimization over the 
finite set of radii 1Z. 

We consider the set of discretized radii 

TZ = {M- 1 2 k ,keN,0<k< flog 2 ra"|}. 

The cardinality of TZ is then |~(log 2 n)~\ + 1 and we consequently choose xr = 
log(log 2 n + 2) for all R € 1Z which satisfies Y^r&ti e~ XR < 1- 

In order to apply Theorem 4.3, the penalty function should satisfy equa- 
tion (4.2). A sufficient condition on the penalty function for the family of 
models {B(R),R € It} is therefore 

pm(R) > Cl ( rk + (CH+^fa+iogrvm 

where c\ is a suitable constant, and we picked K = 3 in equation (4.2). 

Recalling the definition of 7(71) in settings (SI) and (S2), the requirement 
(3.3) on A n and the definition of w\ in Theorem 3.1, it can be checked 



40 G. BLANCHARD, O. BOUSQUET AND P. MASSART 

by elementary manipulations that the above condition on the penalty is 
satisfied in both settings for 

pen(i?) = A n {cp{MR/2) + u^o" 1 ), 

up to a suitable choice of the constant c in (3.3); note that we can assume 
c>l. 

The last step to be analyzed now is how to go back and forth between 
the discretized framework R£lZ and the continuous framework to obtain 
the final result. To apply the model selection theorem, we will interpret the 
continuous regularization defining g as an approximate discretized penalized 
minimization over the above family of models using the penalty function 
defined above. 

In view of definition (3.4) of the estimator g, the following upper bound 
holds: 

PJ(g) + A n <p(M\\g\\ k ) < P n l(0) + A n ip(0) = 1, 

which implies 1 > A„</?(M||(/||jfc). Since we have assumed c > 1 in (3.3), we 
have A n > n^ 1 ; this implies \\g\\k <■ M~ 1 n (using the assumption on 99). De- 
note R = M~ l 2 k where k = |"(log 2 (M||<7|| fc ))+l- The fact that \\g\\ k < M~ l n 
implies R G K. Note that g G B(R) and that R < 2M~ l max(M||g|| fe , 1). This 
entails 

Pni(g) + pen(R) < P n £(g) + A„^(max(M- 1 ||5|| fe ,l)) +r ?( 7 1 u;r 1 A n 

< PJ(g) + A n c^(M- 1 ll^lfc) + A n ip(l) + w^m^An 

< inf [P n i{g) + A n ip{Nr l \\g\\ k )]+A n i P {l)+w^ l i lQ 1 A n 
s&ik 

= inf inf[P n e(g) + A n {^MR) + ^(1) + ^ V)] 

< inf inf [P n t(g)+AMMR) + <p(l) + wi% 1 )], 

HelZg£B(R) 

where the first inequality follows from the definition of pen(R), and the third 
from the definition of g. So if we put pr = A n (ip(M R) + ip(l) + w^ t]q ) — 
pen(i?) > 0, we just proved that g is a (/^-approximate penalized minimum 
loss estimator over the family (B(R))r£h. Now applying the model selection 
theorem (Theorem 4.3), we conclude that the following bound holds with 
probability at least 1 — 5: 

L(g, s*) < 2 inf m£\L(g, s*) + 2A n (<p(MR) + p(l) + w^ 1 )] 

K£IZgeB(R) 

= 2 inf mi [L(g, s*) + 2A nV? (2 1 ^ MR )\ + 4A n (^(l) + ^ V) 

Helig^B(R) 
<2 inf inf \L(g,s*)+2A n <p(2^ M ^)} 
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+ 4A n {<p{l)+w 1 i r] 
+ AA n {ip{l)+w^ 1 r ]o 1 
+ 4A n ( 9 ?(l)+ U ; 1 - 1 7 ? o~ 1 ) 



<2 inf inf \L(g,s*) + 2A n tp(2^°z MR ^)\ 



<2 inf inf [L(g,s*) + 2A n ( i p(2(MRVl)))] 

RKnM- 1 geB(R) 



<2 inf [L(g,s*) + 2A ni p(2M\\g\\ k )]+4A n (2ip(2) + w^ L ^ L ). 
seHfc 

The last inequality holds because, if we denote g* the minimizer of the last 
infimum, comparing it with the constant null function (as for g earlier), 
we conclude that 2A n (p(2M\\g* \\ k ) < 1, implying ||g*||fc < M~ l n, so that the 
restriction R < M~^n in the previous infimum can be dropped. 

APPENDIX A: PROPERTIES OF THE KERNEL INTEGRAL 

OPERATOR. 

In this appendix, we sum up a few useful properties of the integral op- 
erator Lfc introduced in (3.1). These are used in the proof of Lemma 6.8. 
While these results are certainly not new, we provide a self-contained proof 
for completeness. 

Lemma A.l. Let 7ik be a separable RKHS with kernel k on a measurable 
space X . Assume y t— > k(x,y) is measurable for any fixed x E X . Then the 
function x t— > k(x,-) G TC^ is measurable; in particular, (x,y) t— > k(x,y) is 
jointly measurable. 

Let P be a probability distribution on X ; assume L 2 (P) is separable and 
E x ~p[k(X,X)]<oo. 

Then TC^ C L 2 {P) and the canonical inclusion T :7ik ~~ ^ L 2 (P) is contin- 
uous. 

The integral operator L k : L 2 (P) — > L 2 (P) defined as 

{L k f)(x)= [ k(x,y)f{y)dP(y) 



is well defined, positive, self-adjoint and trace class; moreover, L k = TT* . In 
particular, if {\i)i>o denote its eigenvalues, repeated with their multiplicities, 

Ei>o A i<°°- 

Finally, there exists an orthonormal basis (VOi>o of TLk such that, for 
any f eH k , 

i>0 
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Proof. Let us first prove that any function / E Hk is measurable. By 
assumption, for any fixed x, k(x,-) is measurable; hence, also any finite 
linear combination J2i a ik(xi,-). Any function / E Hk is the limit in Hk 
of a sequence of such linear combinations. By the reproducing property, 
a sequence converging in Hk also converges pointwise, since (fi,k(x,-)) = 
fi(x). Hence, / is measurable. Now we prove that x i— ► K(x) = k(x, •) E Hk 
is measurable. 

For any / E Hk, x t— > {k(x, •), /) = f(x) is measurable, hence, the inverse 
image of a half-space by K is measurable. Since Hk is separable, any open 
set is a countable union of open balls (Lindelof property); and any ball in 
Hk is a countable intersection of half-spaces. Hence, K is measurable. This 
implies that k(x,y) = (k(x, -),k(y, -))n is jointly measurable. 

By the Cauchy-Schwarz inequality, we further have \k(x, y)\ 2 < k(x, x)k(y, y), 
so that the assumptions that k(x,x) E L\(P) imply that &(-,-) E L 2 (X x 
X,P (g) P). This ensures that L^ is well defined [as an operator L 2 P {X) — ► 
L 2 p (X)\ and Hilbert-Schmidt, hence, compact. Moreover, by symmetry of 
k, L/% is self-adjoint. As L 2 (P) is separable, Lk can be diagonalized in an 
orthonormal basis (</>«) j>o of L 2 (P), where Lk4>i = Aj0j. 

Consider now the canonical inclusion T from the reproducing kernel Hilbert 
space Hk into L 2 (P). For / E Hk, we have 

f 2 (x) dP(x) = J(f, k(x, -))lc h dP{x) < \\ff Hk I k(x, x) dP(x). 

This proves that T is well defined and continuous on Hk- Let T* :L 2 (P) — ► 
Wfc denote its adjoint. 

For any / E L 2 (P), we have by definition for all x £ X, T* f(x) = (k(x, •), 
T *f)n k = {Tk{x,-)J) L 2 {P) = (L k f)(x). Hence, TT* = L k . In particular, A, = 
{4>i,\i4>i) L 2^p^ = (T* (j>i,T* (f>i)Hk > 0) which proves that L k is a positive op- 
erator. 

Now let us consider the operator C = T*T : Hk — ► Hk- It is bounded, pos- 
itive and self-adjoint. Let (ipi)i>o be an orthonormal basis of Hk- We have 

k(x,x) = (k(x,-),k(x,-)} =^2(k(x,-),Tpi) 2 = ^2^pi(x) 2 

i>0 «>0 

and 

i>0 i>0 i>0 

by monotone convergence. This proves that C is trace-class. Now since TT* 
and T*T have the same nonzero eigenvalues (with identical multiplicities), 
and trC = X)i>o A« < oo, Lk is also trace-class. 
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We can actually choose {ipi) as an orthonormal basis of eigenvectors of C 
with corresponding eigenvalues Aj. In that case, we can write any function 
/ G U k as 

i>0 

where ||/||^ fc = Ei>o(/> V^) 2 and b y continuity of T, 

i>0 

Now, since C^ = A^^, we have (Tipi,Tipj) = \i(ipi,ipj) = \i5ij so that 

\\Tf\\i P = J2^(f^) 2 - 

i>0 
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